Time Series Forecasting¶

This is based on the Kaggle dataset available at https://www.kaggle.com/competitions/store-sales-time-series-forecasting. We are to build a model that will accurately predict the unit sales of multiple items at different stores of a corporation.

In [3]:
# Download data
# Needs a kaggle.json file in C:\Users\<Username>\.kaggle

# import kaggle
# ! kaggle competitions download -c store-sales-time-series-forecasting -q
# ! unzip store-sales-time-series-forecasting
Archive:  store-sales-time-series-forecasting.zip
  inflating: holidays_events.csv     
  inflating: oil.csv                 
  inflating: sample_submission.csv   
  inflating: stores.csv              
  inflating: test.csv                
  inflating: train.csv               
  inflating: transactions.csv        
In [ ]:
!conda install -c plotly plotly-orca
In [4]:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)

import fbprophet
import pmdarima as pm

%matplotlib inline

First, let's import the train and test data and have a look.

In [5]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head(10)
Out[5]:
id date store_nbr family sales onpromotion
0 0 2013-01-01 1 AUTOMOTIVE 0.0 0
1 1 2013-01-01 1 BABY CARE 0.0 0
2 2 2013-01-01 1 BEAUTY 0.0 0
3 3 2013-01-01 1 BEVERAGES 0.0 0
4 4 2013-01-01 1 BOOKS 0.0 0
5 5 2013-01-01 1 BREAD/BAKERY 0.0 0
6 6 2013-01-01 1 CELEBRATION 0.0 0
7 7 2013-01-01 1 CLEANING 0.0 0
8 8 2013-01-01 1 DAIRY 0.0 0
9 9 2013-01-01 1 DELI 0.0 0

Here, we have the following features:

  1. id: This is an id for the sale.
  2. date: This is the date on which the sale occurred.
  3. store_nbr: This is an id for the store at which the sale occurred.
  4. family: This refers to the family of products in which the sale occurred.
  5. sales: This is the total sale value that occurred within that product family for that day.
  6. onpromotion: This refers to whether there was an ongoing promotion for that product family on that day.

Let us check the data types, the missing values, and the unique values, and see whether any preprocessing needs to be done.

In [8]:
print('Column datatypes in train:')
print(train.dtypes,'\n')

print('Missing values in train:')
print(train.isnull().sum(),'\n')

print('Missing values in test:')
print(test.isnull().sum(),'\n')

print('Number of unique stores:',len(train['store_nbr'].unique()),'\n')

print('Unique product families:',train['family'].unique(),'\n')

print('Number of unique product families:',len(train['family'].unique()),'\n')

print('Data range of train:',train['date'].iloc[0],'to',train['date'].iloc[-1])
print('Data range of test:',test['date'].iloc[0],'to',test['date'].iloc[-1])
Column datatypes in train:
id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object 

Missing values in train:
id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64 

Missing values in test:
id             0
date           0
store_nbr      0
family         0
onpromotion    0
dtype: int64 

Number of unique stores: 54 

Unique product families: ['AUTOMOTIVE' 'BABY CARE' 'BEAUTY' 'BEVERAGES' 'BOOKS' 'BREAD/BAKERY'
 'CELEBRATION' 'CLEANING' 'DAIRY' 'DELI' 'EGGS' 'FROZEN FOODS' 'GROCERY I'
 'GROCERY II' 'HARDWARE' 'HOME AND KITCHEN I' 'HOME AND KITCHEN II'
 'HOME APPLIANCES' 'HOME CARE' 'LADIESWEAR' 'LAWN AND GARDEN' 'LINGERIE'
 'LIQUOR,WINE,BEER' 'MAGAZINES' 'MEATS' 'PERSONAL CARE' 'PET SUPPLIES'
 'PLAYERS AND ELECTRONICS' 'POULTRY' 'PREPARED FOODS' 'PRODUCE'
 'SCHOOL AND OFFICE SUPPLIES' 'SEAFOOD'] 

Number of unique product families: 33 

Data range of train: 2013-01-01 to 2017-08-15
Data range of test: 2017-08-16 to 2017-08-31

The following observations can be made:

  1. There are no missing values in either train or test datasets.
  2. The date values are in string. We can parse them into a date object. It would also be helpful to have these dates as index for easy manipulation.
  3. There are 54 unique stores.
  4. There are 33 unique product families.
  5. The data in train starts from Jan 1, 2013 and continues till Aug 15, 2017. The data in test continues from Aug 16, 2017 to Aug 31, 2017. We do not have data from Sept to Dec for 2017.

Let us start with the transformation of the dates column. We also do not need the id column for our analysis, so we can remove it. This will be needed during submission, since we have to submit predicted sales for a given sale id.

In [9]:
for df in [train,test]:
    # Convert date column to date object
    df['date'] = pd.to_datetime(df['date'])
    
    # Remove id column
    df.drop('id', axis=1, inplace=True)

train.head(10)
Out[9]:
date store_nbr family sales onpromotion
0 2013-01-01 1 AUTOMOTIVE 0.0 0
1 2013-01-01 1 BABY CARE 0.0 0
2 2013-01-01 1 BEAUTY 0.0 0
3 2013-01-01 1 BEVERAGES 0.0 0
4 2013-01-01 1 BOOKS 0.0 0
5 2013-01-01 1 BREAD/BAKERY 0.0 0
6 2013-01-01 1 CELEBRATION 0.0 0
7 2013-01-01 1 CLEANING 0.0 0
8 2013-01-01 1 DAIRY 0.0 0
9 2013-01-01 1 DELI 0.0 0
In [10]:
train.dtypes
Out[10]:
date           datetime64[ns]
store_nbr               int64
family                 object
sales                 float64
onpromotion             int64
dtype: object

We have parsed the dates, and we have removed the id column. We can do some preliminary analysis of the data in train now.

Since we have data for each store, we can write a function that will calculate mean of a column over a given time frequency. We can also use interpolation to fill in missing values for days on which data was not recorded. We know it cannot be zero since we have been explicitly given 0 sales when there were no sales.

In [6]:
def freq_resample(df, freq, col, key='date'):
    df_resampled = df.resample(freq, on=key)[col].mean().reset_index()
    df_resampled[col] = df_resampled[col].interpolate()
    
    return df_resampled
In [7]:
train_monthly = freq_resample(train, freq='M', col='sales')
display(train_monthly.head(10))
date sales
0 2013-01-31 186.952405
1 2013-02-28 193.581846
2 2013-03-31 206.880581
3 2013-04-30 205.639071
4 2013-05-31 209.943594
5 2013-06-30 218.655893
6 2013-07-31 203.783364
7 2013-08-31 212.479434
8 2013-09-30 220.593588
9 2013-10-31 213.164266
In [13]:
# Sales analysis by store number
df = train.groupby('store_nbr')['sales'].mean().reset_index().sort_values(by='sales', ascending=False)
fig = px.bar(df, x='store_nbr', y='sales', title='Averaged Sales by Store Number')
fig.show()
In [9]:
# Sales analysis by product family
df = train.groupby('family')['sales'].mean().reset_index().sort_values(by='sales', ascending=True)
fig = px.bar(df, x='sales', y='family', orientation='h', title='Averaged Sales by Product Family')
fig.show()

We can make the following observations:

  1. Sales are generally high for stores from 44 to 51, with store number 3 being up there too.
  2. Some of the highest selling product families are Grocery I, Beverages, Produce, Cleaning, etc. Some of the lowest are Books, Baby Care, Home Appliances, Hardware, etc.

Next, we should look into the stores data.

In [10]:
stores = pd.read_csv('stores.csv')
stores.head()
Out[10]:
store_nbr city state type cluster
0 1 Quito Pichincha D 13
1 2 Quito Pichincha D 13
2 3 Quito Pichincha D 8
3 4 Quito Pichincha D 9
4 5 Santo Domingo Santo Domingo de los Tsachilas D 4

We have the following features:

  1. store_nbr: This is the same id for each store as in train and test.
  2. city: This is the city in which the store is present.
  3. state: This is the state in which the store is present.
  4. type: This is the type of store. No information is available on what the store types represent.
  5. cluster: This is the clustering of similar stores.

We can add this information to the train and test datasets by joining on the store numbers.

In [11]:
train = train.merge(stores, on='store_nbr', how='left')
test = test.merge(stores, on='store_nbr', how='left')

train.head(10)
Out[11]:
date store_nbr family sales onpromotion city state type cluster
0 2013-01-01 1 AUTOMOTIVE 0.0 0 Quito Pichincha D 13
1 2013-01-01 1 BABY CARE 0.0 0 Quito Pichincha D 13
2 2013-01-01 1 BEAUTY 0.0 0 Quito Pichincha D 13
3 2013-01-01 1 BEVERAGES 0.0 0 Quito Pichincha D 13
4 2013-01-01 1 BOOKS 0.0 0 Quito Pichincha D 13
5 2013-01-01 1 BREAD/BAKERY 0.0 0 Quito Pichincha D 13
6 2013-01-01 1 CELEBRATION 0.0 0 Quito Pichincha D 13
7 2013-01-01 1 CLEANING 0.0 0 Quito Pichincha D 13
8 2013-01-01 1 DAIRY 0.0 0 Quito Pichincha D 13
9 2013-01-01 1 DELI 0.0 0 Quito Pichincha D 13

Next, we look at transactions data.

In [12]:
trans = pd.read_csv('transactions.csv')
trans.head(10)
Out[12]:
date store_nbr transactions
0 2013-01-01 25 770
1 2013-01-02 1 2111
2 2013-01-02 2 2358
3 2013-01-02 3 3487
4 2013-01-02 4 1922
5 2013-01-02 5 1903
6 2013-01-02 6 2143
7 2013-01-02 7 1874
8 2013-01-02 8 3250
9 2013-01-02 9 2940

The transactions data contains information about the number of transactions that occurred at a particular store on a particular date. This would be correlated to sales, but transactions are integer numbers. They can be considered to be the number of invoices generated on that day. We can parse the date column.

In [13]:
trans['date'] = pd.to_datetime(trans['date'])

trans.head(10)
Out[13]:
date store_nbr transactions
0 2013-01-01 25 770
1 2013-01-02 1 2111
2 2013-01-02 2 2358
3 2013-01-02 3 3487
4 2013-01-02 4 1922
5 2013-01-02 5 1903
6 2013-01-02 6 2143
7 2013-01-02 7 1874
8 2013-01-02 8 3250
9 2013-01-02 9 2940

We can look at the transactions through the dates for the different stores.

In [14]:
fig = px.line(trans.sort_values(["store_nbr", "date"]), x='date', y='transactions', color='store_nbr',\
              title='Transactions')
fig.show()

Double clicking on a store number isolates the transactions time series for that store. We can see some patterns in these plots.

  1. Sales are high around the end of the year, especially around Christmas.
  2. The rest of the months are fairly similar.

In order to check these trends better, we can increase granularity by taking the mean of transactions for all the stores and over each month.

In [15]:
df = freq_resample(trans, freq='M', col='transactions')
fig = px.line(df.sort_values('date'), x='date', y='transactions', title='Averaged transactions over all stores per month')
fig.show()

This plot helps us view the trends better. There are clear peaks during the months of December.

Next, we can see the shopping patterns based on day of the week.

In [16]:
df = trans.copy()
df["year"] = df.date.dt.year
df["dayofweek"] = df.date.dt.dayofweek+1
df = df.groupby(["year", "dayofweek"])['transactions'].mean().reset_index()
fig = px.line(df, x="dayofweek", y="transactions" , color = "year", title = "Averaged transactions over day of the week for all stores")
fig.show()

It's clear that majority of the shopping occurs on the weekends, while Thursday is generally a slow day.

Next, we look at the oil data.

In [17]:
oil = pd.read_csv('oil.csv')
oil.head()
Out[17]:
date dcoilwtico
0 1/1/2013 NaN
1 1/2/2013 93.14
2 1/3/2013 92.97
3 1/4/2013 93.12
4 1/7/2013 93.20

The oil data contains the daily oil price. This is given since the country is oil dependent so its economical health is vulnerable to shocks in oil prices. Looking at the first row, there seem to be missing values in the data. Let us have a look.

In [18]:
oil['date'] = pd.to_datetime(oil['date'])
oil.isnull().sum()
Out[18]:
date           0
dcoilwtico    43
dtype: int64

There are 43 missing values. Since this is a daily data, we can use interpolation to fill in the missing values. The first row value can be filled using backfill method. Let us also rename the column to 'price'.

In [19]:
oil.rename(columns={'dcoilwtico':'price'}, inplace=True)
oil['price'] = oil['price'].interpolate()
print(oil.isnull().sum(),'\n')

oil['price'].fillna(method='bfill', inplace=True)
print(oil.isnull().sum(),'\n')
date     0
price    1
dtype: int64 

date     0
price    0
dtype: int64 

In [20]:
oil.head(10)
Out[20]:
date price
0 2013-01-01 93.14
1 2013-01-02 93.14
2 2013-01-03 92.97
3 2013-01-04 93.12
4 2013-01-07 93.20
5 2013-01-08 93.21
6 2013-01-09 93.08
7 2013-01-10 93.81
8 2013-01-11 93.60
9 2013-01-14 94.27

We can now have a look at the price trends.

In [21]:
fig = px.line(oil, x='date', y='price', title='Daily oil price')
fig.show()

It is clear that the oil price fell down drastically in 2016. We are given that Ecuador's economy is highly dependent on the oil price. We can check this claim by performing a correlation analysis between oil price and transactions or sales.

In [22]:
# a = transactions.copy()
# a = a.groupby(a['date'].dt.date)['transactions'].sum().reset_index()
# # display(a.head())

# b = train.copy()
# b = b.groupby(b['date'].dt.date)['sales'].sum().reset_index()
# # display(b.head())

# c = oil.copy()
# c.rename(columns={'price':'oil_price'}, inplace=True)
# # display(c.head())

a = freq_resample(trans, freq='D', col='transactions')
b = freq_resample(train, freq='D', col='sales')
c = oil.rename(columns={'price':'oil_price'})

d = a.merge(b, on='date', how='inner')
d['date'] = pd.to_datetime(d['date'])
d = c.merge(d, on='date', how='inner')
d['date'] = pd.to_datetime(d['date'])

display(d.head())
date oil_price transactions sales
0 2013-01-01 93.14 770.000000 1.409438
1 2013-01-02 93.14 2026.413043 278.390807
2 2013-01-03 92.97 1706.608696 202.840197
3 2013-01-04 93.12 1706.391304 198.911154
4 2013-01-07 93.20 1643.413043 188.621100
In [23]:
d.isnull().sum()
Out[23]:
date            0
oil_price       0
transactions    0
sales           0
dtype: int64
In [24]:
fig = make_subplots(rows=3, cols=1)


fig.add_trace(go.Scatter(x=d['date'], y=d['oil_price'], name='Daily oil price', mode='lines'),
              row=1, col=1)
fig.add_trace(go.Scatter(x=d['date'], y=d['transactions'], name='Daily total transactions', mode='lines'),
              row=2, col=1)
fig.add_trace(go.Scatter(x=d['date'], y=d['sales'], name='Daily total sales', mode='lines'),
              row=3, col=1)

fig.update_layout(height=1000, width=1000)
fig.show()

While the correlation isn't too apparent, we can draw some inferences based on the peaks in sale values. As the price of oil fell down near the end of 2016, the sales values increased from 2016. We can calculate a correlation coefficient to test our hypothesis.

In [25]:
d.corr(method='spearman')
Out[25]:
oil_price transactions sales
oil_price 1.000000 0.274351 -0.652512
transactions 0.274351 1.000000 0.134563
sales -0.652512 0.134563 1.000000

The correlation coefficient between oil price and transactions is about 0.27 while that between oil price and sales is about -0.65. Thus, oil price is indeed negatively correlated with store sales, but it does not have much relation with transactions. We can say that the more oil price reduces, the cheaper shopping items become, leading to more sales. It must also be noted that transactions and sales aren't highly correlated as well.

Before moving on to the holidays and events data, we can start with analyzing the time series based on what we have. We need to have a stationary time series before we can build any forecasting models. A stationary time series has 4 components:

  1. Constant mean
  2. Constant variance
  3. Constant autocorrelation structure
  4. No seasonal or periodic component

For now, we can treat the daily resampled sales data to be the initial time series and work on forecasting based on its properties.

In [26]:
df = freq_resample(train, freq='W', col='sales')
df.head()
Out[26]:
date sales
0 2013-01-06 206.843478
1 2013-01-13 190.285220
2 2013-01-20 189.835452
3 2013-01-27 182.152050
4 2013-02-03 198.564267

Let us also convert the index to datetime, since it makes it easier for further calculations.

In [27]:
df.set_index('date', inplace=True)
df.head()
Out[27]:
sales
date
2013-01-06 206.843478
2013-01-13 190.285220
2013-01-20 189.835452
2013-01-27 182.152050
2013-02-03 198.564267

Let us write a function that outputs all the results from an ad-fuller test, which is used to test whether a series is non-stationary.

In [98]:
def adf_test(timeseries):
    #Perform Dickey-Fuller test:
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
In [28]:
fig = px.line(df, y='sales', title='Averaged Weekly Sales over all Stores')
fig.show()

Let us use ARMA, ARIMA, SARIMA models. We need to split the dataset into train and test. We can keep the data of 2017 for testing, while the rest of the data can be used for training.

In [29]:
df_train = df[df.index < pd.to_datetime('2017-01-01')]
df_test = df[df.index > pd.to_datetime('2017-01-01')]

print(df_train.shape)
print(df_test.shape)
(208, 1)
(33, 1)
In [30]:
y = df_train['sales']
model = ARIMA(y, order = (20, 1, 2))
model = model.fit()
y_pred = model.get_forecast(len(df_test.index))
y_pred_df = y_pred.conf_int(alpha = 0.05) 
y_pred_df["Predictions"] = model.predict(start = y_pred_df.index[0], end = y_pred_df.index[-1])
y_pred_df.index = df_test.index
y_pred_out = y_pred_df["Predictions"] 

fig = plt.figure()
fig.set_size_inches(15, 10)
plt.plot(df_train, color='blue', label='Train')
plt.plot(df_test, color='red', label='Test')
plt.plot(y_pred_out, color='green', label = 'Predictions')
plt.legend(fontsize=20)

fig.suptitle('Time Series Forecasting of Store Sales', fontsize=30)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Sales', fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning:

No frequency information was provided, so inferred frequency W-SUN will be used.

C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning:

No frequency information was provided, so inferred frequency W-SUN will be used.

C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning:

No frequency information was provided, so inferred frequency W-SUN will be used.

C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\base\model.py:604: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

Out[30]:
(array([100., 200., 300., 400., 500., 600., 700.]),
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])

In order to take this one step forward, we can look at a single store and a single product family. If we are able to set up our model to be robust enough for all stores and product families, we can perform predictions for the entire test set.

In [6]:
stores = train['store_nbr'].unique()
stores.sort()
print(stores,'\n')

families = train['family'].unique()
families.sort()
print(families)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54] 

['AUTOMOTIVE' 'BABY CARE' 'BEAUTY' 'BEVERAGES' 'BOOKS' 'BREAD/BAKERY'
 'CELEBRATION' 'CLEANING' 'DAIRY' 'DELI' 'EGGS' 'FROZEN FOODS' 'GROCERY I'
 'GROCERY II' 'HARDWARE' 'HOME AND KITCHEN I' 'HOME AND KITCHEN II'
 'HOME APPLIANCES' 'HOME CARE' 'LADIESWEAR' 'LAWN AND GARDEN' 'LINGERIE'
 'LIQUOR,WINE,BEER' 'MAGAZINES' 'MEATS' 'PERSONAL CARE' 'PET SUPPLIES'
 'PLAYERS AND ELECTRONICS' 'POULTRY' 'PREPARED FOODS' 'PRODUCE'
 'SCHOOL AND OFFICE SUPPLIES' 'SEAFOOD']
In [465]:
def sales_subset(store, family, plot_sales=False):
    df = train.loc[(train['store_nbr']==store) & (train['family']==family),['date','sales']]
    df.set_index('date', inplace=True)
    df = df.asfreq('d')
    
    df.fillna(0, inplace=True)
    
    if plot_sales == True:
        fig = px.line(df, y='sales', title=f'Sales from Store {store} within Product Family {family}')
        fig.show()
        
    return df
In [466]:
train.loc[(train['store_nbr']==stores[0]) & (train['family']==families[0]),:]
Out[466]:
date store_nbr family sales onpromotion city state type cluster
0 2013-01-01 1 AUTOMOTIVE 0.0 0 Quito Pichincha D 13
1782 2013-01-02 1 AUTOMOTIVE 2.0 0 Quito Pichincha D 13
3564 2013-01-03 1 AUTOMOTIVE 3.0 0 Quito Pichincha D 13
5346 2013-01-04 1 AUTOMOTIVE 3.0 0 Quito Pichincha D 13
7128 2013-01-05 1 AUTOMOTIVE 5.0 0 Quito Pichincha D 13
... ... ... ... ... ... ... ... ... ...
2991978 2017-08-11 1 AUTOMOTIVE 1.0 0 Quito Pichincha D 13
2993760 2017-08-12 1 AUTOMOTIVE 6.0 0 Quito Pichincha D 13
2995542 2017-08-13 1 AUTOMOTIVE 1.0 0 Quito Pichincha D 13
2997324 2017-08-14 1 AUTOMOTIVE 1.0 0 Quito Pichincha D 13
2999106 2017-08-15 1 AUTOMOTIVE 4.0 0 Quito Pichincha D 13

1684 rows × 9 columns

In [467]:
df = sales_subset(stores[0], families[0], plot_sales=False)
In [468]:
fig, ax = plt.subplots(2,1,figsize=(15,10))
plot_acf(df, lags=60, ax=ax[0], zero=False);
plot_pacf(df, lags=60, ax=ax[1], zero=False);
In [469]:
adf_test(df)
Results of Dickey-Fuller Test:
Test Statistic                   -3.877970
p-value                           0.002203
#Lags Used                       24.000000
Number of Observations Used    1663.000000
Critical Value (1%)              -3.434288
Critical Value (5%)              -2.863280
Critical Value (10%)             -2.567696
dtype: float64

The test statistic is less than critical values, and the p-value is less than 0.05. Thus, according to the adf test, our time series is stationary.

In [474]:
model = SARIMAX(df, order=(1,0,1), seasonal_order=(1,1,2,7), trend='c').fit()

result = model.predict(start=pd.to_datetime('2017-01-01'), dynamic=False)
result = pd.DataFrame(result)
result.rename(columns={'predicted_mean':'pred'}, inplace=True)
result['actual'] = df['2017':]

rmse = np.sqrt(mean_squared_error(result['actual'],result['pred']))
print('RMSE:',rmse)
RMSE: 2.7111257696954127
In [458]:
result = model.predict(start=pd.to_datetime('2017-08-16'), end=pd.to_datetime('2017-08-31'), dynamic=False)
result = pd.DataFrame(result)
result.rename(columns={'predicted_mean':'pred'}, inplace=True)
result
Out[458]:
pred
2017-08-16 4.717261
2017-08-17 4.516105
2017-08-18 5.486206
2017-08-19 5.044160
2017-08-20 2.764200
2017-08-21 4.746283
2017-08-22 4.829707
2017-08-23 5.003198
2017-08-24 4.595946
2017-08-25 4.944141
2017-08-26 5.278528
2017-08-27 2.609867
2017-08-28 4.484599
2017-08-29 5.166893
2017-08-30 4.918654
2017-08-31 4.715717

Thus, we see that we can make predictions for every store and product family. For now, we can set a general SARIMAX model, keep track of RMSE and predicted sales. After running the code completely, we can look at edge cases and figure out individually the problems.

In [657]:
# rmse = pd.read_csv('rmse_train.csv')
# rmse.set_index(['store','family'], inplace=True)
# a = pd.read_csv('test_pred.csv')
In [7]:
total_len = len(stores)*len(families)
rmse = pd.DataFrame({'store':stores.repeat(len(families)).tolist(),
                     'family':np.concatenate([families]*len(stores),axis=0).tolist(),
                     'error':np.zeros(total_len)})
rmse.set_index(['store','family'], inplace=True)

a = test.copy()

for i in range(len(stores)):
    store = stores[i]
    for j in range(len(families)):
        family = families[j]
        df = train.loc[(train['store_nbr']==store) & (train['family']==family),['date','sales']]
        df = df[df['sales'] > 0]
        df.set_index('date', inplace=True)
        df = df.asfreq('d')
        df.fillna(0, inplace=True)
        
        model = SARIMAX(df, order=(1,0,1), seasonal_order=(3,1,3,7), trend='c').fit()
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
        
        # Calculate rmse on validation set
        result = model.predict(start=pd.to_datetime('2017-01-01'), dynamic=False)
        result = pd.DataFrame(result)
        result['actual'] = df['2017':]
        rmse.loc[store,family] = np.sqrt(mean_squared_error(result['actual'],result['predicted_mean']))
        
        # Calculate predictions on test dataset
        result = model.predict(start=pd.to_datetime('2017-08-16'), end=pd.to_datetime('2017-08-31'), dynamic=False)
        result = pd.DataFrame(result)
        a.loc[(a['store_nbr']==store) & (a['family']==family), 'sales'] = result['predicted_mean'].values
        
    print('Store Number: ', store)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [7], in <cell line: 9>()
     14 df = df[df['sales'] > 0]
     15 df.set_index('date', inplace=True)
---> 16 df = df.asfreq('d')
     17 df.fillna(0, inplace=True)
     19 model = SARIMAX(df, order=(1,0,1), seasonal_order=(3,1,3,7), trend='c').fit()

File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\frame.py:10517, in DataFrame.asfreq(self, freq, method, how, normalize, fill_value)
  10508 @doc(NDFrame.asfreq, **_shared_doc_kwargs)
  10509 def asfreq(
  10510     self,
   (...)
  10515     fill_value=None,
  10516 ) -> DataFrame:
> 10517     return super().asfreq(
  10518         freq=freq,
  10519         method=method,
  10520         how=how,
  10521         normalize=normalize,
  10522         fill_value=fill_value,
  10523     )

File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\generic.py:7697, in NDFrame.asfreq(self, freq, method, how, normalize, fill_value)
   7590 """
   7591 Convert time series to specified frequency.
   7592 
   (...)
   7693 2000-01-01 00:03:00    3.0
   7694 """
   7695 from pandas.core.resample import asfreq
-> 7697 return asfreq(
   7698     self,
   7699     freq,
   7700     method=method,
   7701     how=how,
   7702     normalize=normalize,
   7703     fill_value=fill_value,
   7704 )

File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\resample.py:2092, in asfreq(obj, freq, method, how, normalize, fill_value)
   2089 elif len(obj.index) == 0:
   2090     new_obj = obj.copy()
-> 2092     new_obj.index = _asfreq_compat(obj.index, freq)
   2093 else:
   2094     dti = date_range(obj.index.min(), obj.index.max(), freq=freq)

File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\resample.py:2129, in _asfreq_compat(index, freq)
   2127     new_index = TimedeltaIndex([], dtype=index.dtype, freq=freq, name=index.name)
   2128 else:  # pragma: no cover
-> 2129     raise TypeError(type(index))
   2130 return new_index

TypeError: <class 'pandas.core.indexes.base.Index'>
In [765]:
rmse.to_csv('rmse_train.csv')
a.to_csv('test_pred.csv')
In [752]:
a.sort_values(by='sales',ascending=True)
Out[752]:
Unnamed: 0 id date store_nbr family onpromotion sales
11404 11404 3012292 2017-08-22 29 LADIESWEAR 0 0.000000
20002 20002 3020890 2017-08-27 20 BOOKS 0 0.000000
23299 23299 3024187 2017-08-29 13 BABY CARE 0 0.000000
15805 15805 3016693 2017-08-24 51 SCHOOL AND OFFICE SUPPLIES 0 0.000000
5812 5812 3006700 2017-08-19 22 BOOKS 0 0.000000
... ... ... ... ... ... ... ...
20892 20892 3021780 2017-08-27 45 BEVERAGES 44 15076.754185
8460 8460 3009348 2017-08-20 46 GROCERY I 39 15110.022900
20934 20934 3021822 2017-08-27 46 GROCERY I 65 15386.823700
8493 8493 3009381 2017-08-20 47 GROCERY I 41 15525.989636
20967 20967 3021855 2017-08-27 47 GROCERY I 58 15701.724885

28512 rows × 7 columns

In [753]:
rmse.sort_values(by='error', ascending=False)
Out[753]:
error
store family
46 GROCERY I 2669.809855
45 GROCERY I 2582.597331
9 GROCERY I 2482.416193
44 BEVERAGES 2455.476134
48 GROCERY I 2329.163602
... ... ...
35 LADIESWEAR 0.000004
13 BABY CARE 0.000004
49 BABY CARE 0.000004
52 BOOKS 0.000004
13 BOOKS 0.000004

1782 rows × 1 columns

We see that we have some predictions that have high rmse values. We can now look at a case by case basis on the underlying reason.

In [755]:
store = 46
family = 'GROCERY I'
df = train.loc[(train['store_nbr']==store) & (train['family']==family),['date','sales']]
df = df[df['sales'] > 0]
df.set_index('date', inplace=True)
df = df.asfreq('d')
df.fillna(0, inplace=True)

px.line(df, y='sales')
In [764]:
model = SARIMAX(df, order=(1,0,1), seasonal_order=(3,1,3,7), trend='c').fit()
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UserWarning)

# Calculate rmse on validation set
result = model.predict(start=pd.to_datetime('2017-01-01'), dynamic=False)
result = pd.DataFrame(result)
result['actual'] = df['2017':]
np.sqrt(mean_squared_error(result['actual'],result['predicted_mean']))
Out[764]:
2538.391271358647